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Abstract 

We present a Bayesian approach to adapting parameters of a 
well-trained context-dependent, deep-neural-network, hidden 
Markov model (CD-DNN-HMM) to improve automatic speech 
recognition performance. Given an abundance of DNN param¬ 
eters but with only a limited amount of data, the effectiveness of 
the adapted DNN model can often be compromised. We formu¬ 
late maximum a posteriori (MAP) adaptation of parameters of 
a specially designed CD-DNN-HMM with an augmented linear 
hidden networks connected to the output tied states, or senones, 
and compare it to feature space MAP linear regression previ¬ 
ously proposed. Experimental evidences on the 20,000-word 
open vocabulary Wall Street Journal task demonstrate the fea¬ 
sibility of the proposed framework. In supervised adaptation, 
the proposed MAP adaptation approach provides more than 
10% relative error reduction and consistently outperforms the 
conventional transformation based methods. Furthermore, we 
present an initial attempt to generate hierarchical priors to im¬ 
prove adaptation efficiency and effectiveness with limited adap¬ 
tation data by exploiting similarities among senones. 

Index Terms: deep neural networks, hidden Markov model, 
Bayesian adaptation, automatic speech recognition. 

1. Introduction 

Despite the recent outstanding results demonstrated by context- 
dependent, deep-neural-network based hidden Markov mod¬ 
els (CD-DNN-HMMs) in various automatic speech recogni¬ 
tion (ASR) tasks and data sets mm, these acoustic models, 
similarly to conventional context-dependent, Gaussian-mixture- 
model based HMMs (CD-GMM-HMMs) Gfl, still suffer from 
a performance degradation under potential mismatched condi¬ 
tions between training and testing conditions. For standard hy¬ 
brid system using artificial neural networks (ANNs) and HMMs 
l5l in which CD-DNN-HMM is a special case, there exist many 
adaptation techniques. The simplest approach modifies all 
weights of the connectionist architecture using some adaptation 
materials. Unfortunately, it leads to over-fitting on the adap¬ 
tation material when the amount of adaptation patterns is lim¬ 
ited (§). Recent approaches, such as regularization based I3H 
subspace based felllOl. transformation based GUCEI] CEDED, 
i-Vector based ED, native neural network based (13111113, 
factorization based ED and fast adaptation schemes based on 
discriminant speaker codes |Q]3[l6]|20|l2T), have been proposed 
to circumvent the problem. 

Transformation based methods are the most popular con¬ 
nectionist adaptation techniques. The key idea is to augment 
the structure of the ANN component by adding an affine trans¬ 


formation network to the input (5), hidden ED , or output layer 
(221 . They are typically trained while keeping the rest of the 
network parameters fixed. Motivations for these approaches 
stem from the concept that only relatively few parameters could 
be learned during adaptation and therefore it is preferable to 
training the entire network when the adaptation set is limited. 
For linear hidden network (LHN) layer approaches, the last hid¬ 
den layer is usually designed to be a bottleneck to ensure an 
affordable parameter size I23ll24ll25ll26) . 

However, adapting parameters in a CD-DNN-HMM is 
much more challenging than earlier connectionist adaptation 
schemes because of its huge parameter set size with a large 
number of network branches connected to a large set of tied 
HMM states, often referred to as senones ( 23 . Furthermore, 
DNN parameters are adapted by every sample frame regardless 
of its senone class. Therefore, the posterior probabilities for the 
unobserved and scarcely seen senones are often pushed towards 
zero during adaptation. Such a phenomenon is commonly re¬ 
ferred to as catastrophic forgetting (28). Conservative ad-hoc 
solutions for ANNs have been proposed to force the senone dis¬ 
tribution estimated from the adapted model to be close to that 
of the unadapted model. For example, a Kullback-Leibler diver¬ 
gence (KLD) based objective criterion to be used during adapta¬ 
tion was devised in [8] in order to alleviate the catastrophic for¬ 
getting problem. A variation to the standard method of assign¬ 
ing the target values was instead discussed in CD . Nonetheless, 
Bayesian solutions adopted in the CD-GMM-HMMs to address 
the same issue (29) have not been fully exploited. 

In this study, we attempt to cast DNN adaptation within 
a Bayesian framework in the spirit of maximum a posteriori 
(MAP) adaptation f30l . The key goal is to re-estimate some 
DNN parameters by representing available information in an 
augmented linear hidden network (LHN) added after the last 
non-linear hidden layer. Experimental results on the 20,000- 
word open vocabulary Wall Street Journal task demonstrates the 
feasibility of the proposed approach. Under supervised adapta¬ 
tion, the proposed MAP adaptation scheme can provide a rela¬ 
tive word error rate (WER) reduction of more than 10% from an 
already-strong speaker independent CD-DNN-HMM baseline 
and consistently outperform conventional transformation based 
adaptation schemes. It also compares favorably against the fea¬ 
ture space maximum a posteriori linear regression approach to 
speaker adaptation proposed in ED- We also present an initial 
attempt to generate hierarchical priors for improving adaptation 
efficiency with small amounts of adaptation data by exploiting 
the similarities among senones. 


2. Training of Deep Models 

In DNNs, hidden layers are usually constructed by sigmoid 
units, and the output layer is a softmax layer. The values of 
the nodes can therefore be expressed as: 
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where Wi, and W i are the weight matrices, bi, and bi are the 
bias vectors, o* is the input frame at time t, L is the total num¬ 
ber of the hidden layers, and both sigmoid and softmax func¬ 
tions are element-wise operations. The vector x* corresponds 
to pre-nonlinearity activations, and y* and y l are the vectors of 
neuron outputs at the i th hidden layer and the output layer, re¬ 
spectively. The softmax outputs were considered as an estimate 
of the senone posterior probability: 
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Figure 1: Basic neural architecture for adapting the HMM/ANN 
parameters: weights associated with the links in the dashed rect¬ 
angles are estimated while all other weights remain unchanged. 
The activation function of each LHN units is a linear function. 


p(C,\o t )=y t L (j) = 


exp (xj(i)) 
E ex p( x L(*))’ 


(3) 


where Cj represents the j th senone and y l(J) ' s the j th ele- 
ment of y^. 

The DNN is trained by maximizing the log posterior proba¬ 
bility over the training frames. This is equivalent to minimizing 
the cross-entropy objective function. Let A be the whole train¬ 
ing set, which contains T frames, i.e. o 1:T G X, then the loss 
with respect to X is given by 


using the LIN layer, we might harm the ability of data repre¬ 
sentation of the hidden layers. On the other hand for the LON 
approach, the issue is that usually we can’t reduce the number of 
neurons of the output layer because we want to directly model 
the senones (the number of senones can be more than 10000 
in practice), and that means we have to add a huge augmented 
layer with even more parameters to be adapted. 

If we deem the hidden layers as a feature extractor and the 
output layer as the discriminative model. The model parameters 
are the weights of the output layer’s affine transform matrix, 
W l • The output yL can now be expressed as: 


* 1!T = ~tt p* (i)l°gp(C'j|o t ), (4) 

t~ 1 j—l 

where p{Cj\o l ) is defined in Eq. { 3 }; p t is the target proba¬ 
bility of frame t. In real practices of DNN systems, the target 
probability p* is often obtained by a forced alignment with an 
existing system resulting in only the target entry that is equal 
to 1 . Mini-batch stochastic gradient descent (SGD) m, with a 
reasonable size of mini-batches to make all matrices fit into the 
GPU memory, was used to update all neural parameters during 
training. Pre-training methods was used for the initialisation of 
the DNN parameters (33). 

3. Transformation Based Adaptation for 
Deep Models 

For DNN adaptation, some researchers choose to add an affine 
transformation network between the last hidden layer and the 
output layer weights matrix, i.e., an LHN, and adapt only the 
LHN parameters while keeping fixed all of the other DNN pa¬ 
rameters CD. In order to reduce the amount of parameters to 
adapt, usually the last hidden layer is designed to be a bot¬ 
tleneck (less neurons) (23] El [25] [26]. Superior results were 
obtained by this kind of LHN formulation than other transfor¬ 
mation based adaptation schemes, such as linear input network 
(LIN) and linear output network (LON). 

The LIN approach performs adaptation by adding an aug¬ 
mented linear input layer and only adapts this set of LIN param¬ 
eters. If we follow the common idea that the hidden layers of 
a DNN is actually learning a more suitable data representation 
and extracting better “feature” for the output layer that is serv¬ 
ing as a log-linear model, then by transforming the raw input 


y L = softmax(Wz,yz,-i), (5) 

where the activation at the last hidden layer, y^-i, can be used 
as the new feature representation extracted by the hidden layers. 
When adding an augmented LHN after the last hidden layer, it 
is equivalent to applying a transformation matrix W^ n to the 
model parameters to obtain an adapted model parameter set: 

y L = soft max (W^nW^yL-i), (6) 

An LHN adaptation structure is shown in Figure [I] This 
formulation is quite similar to maximum likelihood linear re¬ 
gression (MLLR) (34) . The difference is that in MLLR the 
model parameters are Gaussian mean and variance while here 
the model parameters are the log-linear model’s transformation 
matrix weights. 

4. MAP Adaptation for Deep Models 

Although conventional DNN adaptation approaches try to alle¬ 
viate over-fitting issues by reducing the number of parameters 
to be adapted, such number could still be very big in some cases. 
Inspired by the MAP adaptation that address the problem effec¬ 
tively in GMM-HMM systems, in this section, we explain how 
to apply the MAP approach to the LHN adaptation. Note that 
though we choose LHN for demonstration, the proposed MAP 
approach can be easily applied to other DNN adaptation frame¬ 
works like (8l fl4l IT6l ITT) as well. 

4.1. Prior Estimation 

In order to establish a MAP adaptation framework like in [30], 
a prior distribution over the weights of the affine transformation 
network need to be imposed. To analyze and estimate the prior 










density, we utilized the training data of the baseline DNN. We 
adopted an empirical Bayes approach f3Q|[29l and treated each 
speaker in the training set as a sample speaker and supervised 
LHN adaptation was performed. After that, we can get a partic¬ 
ular LHN for each speaker. We observed that the histograms for 
weights of the adapted LHN over speakers are quite like Gaus¬ 
sian, so we assume that the distribution of the weights in W^ n 
to be joint Gaussian ed. By expressing the weights in the LHN 
transformation matrix W^ n as a vector w with each entry rep¬ 
resenting a particular weight, we have the prior density in the 
following form: 

1 1 T —1 

p(Wlhn) = (2 ^)M/2| S |l/2 eX P(— 2 (w~ AO E (w-#*)) 

(7) 

where only the diagonal entries of the covariance matrix E are 
non-zero (from the independence assumption of the weights). 

With N adapted speaker weight vectors, the maximum like¬ 
lihood estimation of the mean /i, and variance E can be ex¬ 
pressed as: 

1 N 

Uml = E Wi (8) 


1 x -V 

Eml =* — y^(w?; — gMi)(wi — Hml) T (9) 

i= 1 

where w* is the vector consisting of the adapted transformation 
weights of speaker i. 

4.2. MAP Formulation 

Formal MAP adaptation is conducted following ED- Eq. {TO} 
formulates the MAP learning idea by adding the term of prior 
density p(Wihn) to the plain cross entropy objective function. 

£ map — — Alogp(W^ n ) + C xent (10) 

Applying the prior form of Eq. (j7j, the objective function for 
MAP LHN adaptation is in the form of Eq. GD- 

C'mAP = ^(w - ^) T S _1 (w - fj,) + C^ent (11) 

where only the diagonal entries of the covariance matrix E are 
non-zero (from the independence assumption of the weights). 

A close look at Eq. 0. when the prior density is a stan¬ 
dard Gaussian A/*(0,1), MAP learning will degenerate to con¬ 
ventional L2-regularized training. The gradient of with 

respect to w can now be expressed as: 

^^=A(w- M fdla 5 (S- 1 )+^i, (12) 

where diag( E -1 ) consists of the diagonal entries of E _1 . 

5. Experiments 

5.1. Experimental Setup 

This study is concerned with the problem of speaker adaptation , 
and experiments are reported on the 20k-word open vocabulary 
Wall Street Journal task (35) using the Kaldi toolkit [36j. The 
baseline CD-DNN-HMM system was trained using the WSJO 
material (SI-84). The standard adaptation set of WSJO (si_et_ad, 


8 speakers, 40 sentences per speaker) was used to perform 
adaptation of the affine transformation added to the speaker- 
independent DNN. The standard open vocabulary 20,000-word 
(20K) read NVP Senneheiser microphone (si_et_20, 8 speakers 
x 40 sentences) data were used for evaluation. A standard tri¬ 
gram language model was adopted during decoding. The ASR 
performance was given in terms of the word error rate (WER). 

The DNN has six hidden layers. The first five hidden layers 
have 2048 units, whereas the last hidden layer has 216 units. 
The output layer has 2022 softmax units. This DNN archi¬ 
tecture follows conventional configurations used in the speech 
community except for the last hidden layer, which acts as a 
bottleneck layer. This configuration was chosen, because a 
too large dimension of the last non-linear hidden layer might 
have been harmful for LHN adaptation. The bottleneck based 
low rank methods have been widely used to achieve more com¬ 
pact DNN models with equivalent performance ['23l [24H25ll26 1. 
The number units equal to 216 was chosen to simulate a sort 
of three-state phone layer thereby obtaining a kind of hierar¬ 
chical structure between mono-phones in the hidden layer and 
senones at the output layer. The input feature vector is a 23- 
dimension mean-normalized log-filter bank feature with up to 
second-order derivatives and a context window of 11 frames, 
forming a vector of 759-dimension (69 x 11) input. The DNN 
was trained with an initial learning rate of 0.008 using the cross¬ 
entropy objective function. It was initialised with the stacked 
restricted Boltzmann machines by using layer by layer genera¬ 
tive pre-training. 

5.2. Experimental Results 

The word error rate (WER) attained with different adaptation 
techniques are reported in Table [T] All available adaptation ma¬ 
terial was used for performing adaptation, namely 40 sentences 
per speaker. The term BASELINE refers to the speaker inde¬ 
pendent CD-DNN-HMM system. LIN, LIN-KLD, and MAP 
LIN refer to the adaptation technique based on the standard lin¬ 
ear input network approach, the KLD regularisation techniqu^] 
in combination with LIN, and the maximum a posteriori trans¬ 
formation based adaptation when a prior is defined over the LIN 
parameters ffi\ . respectively. The terms LON and LON-KLD 
are used to denote, with a little abuse of terminology, the direct 
adaptation of the output layer weights matrix with or without 
KLD, respectively. LHN adaptation results are also reported 
along with the corresponding MAP version, MAP LHN, which 
is the adaptation approach proposed in this paper. Since LHN 
was inserted between the last hidden layer and the output layer 
weights matrix, its dimension is 216 x 216. LHN is initialised 
to an identity matrix with zero bias, which gives a starting point 
equivalent to the unadapted model. Supervised adaptation is 
then performed updating only the LHN parameters. MAP was 
performed as described in Section [4] For the sake of compar¬ 
ison, LHN-KLD, which denotes standard LHN combined with 
KLD, was also evaluated. 

Indeed LIN and LHN outperforms LON, which attains the 
worst performance improvement. KLD always improves over 
affine transformation based adaptation techniques, as expected. 
Furthermore, the proposed MAP LHN outperforms all other 
techniques in the given task, and it attains the best recognition 
results with a relative improvement of 10.4% over the BASE¬ 
LINE. Finally, we would like to remark that MAP LHN com¬ 
pares favourably against MAP LIN, and that confirms that the 


! The best KLD results obtained in our laboratories are reported. 







Table 1: Comparing WERs on the 20K word open vocabulary 
WSJO task for several adaptation approaches using all 40 avail¬ 
able adaptation utterances. 


System 

WER (in %) 

BASELINE 

8.84% 

LIN 

8.22% 

LIN-KLD 

8.06% 

MAP LIN 

8.13% 

LON 

8.80% 

LON-KLD 

8.64% 

LHN 

8.22% 

LHN-KLD 

8.15% 

MAP LHN 

7.92% 


Table 2: Comparing LHN and MAP LHN performance with 
different amounts of adaptation data. 


# Adaptation 
Sentences 

Standard 

LHN 

MAP 

LHN 

BASELINE 

8.84% 

5 

8.59% 

8.54% 

10 

8.52% 

8.52% 

20 

8.31% 

8.12% 

40 

8.22% 

7.92% 


introduction of the bottleneck layer was the key for a proper 
deployment of MAP LHN. 

Table [2] shows experimental results for LHN, and MAP 
LHN with different amounts of adaptation sentences, namely 
{5,10, 20,40}, in the second and third columns, respectively. 
These results confirm that MAP LHN adaptation almost always 
improves over standard LHN, with the best adaptation results at 
a WER of 7.92% using 40 utterances. But in very limited adap¬ 
tation data cases, namely, {5,10}, there is only slight or even no 
improvement by only using flat prior in the MAP adaptation, so 
we turned to our preliminary investigation of hierarchical priors 
for dealing with the data scarcity problem. 

5.3. Hierarchical Priors: Preliminary Experiments 

Hierarchical structures, such as trees, have long been used in 
the speech community to address the over-fitting issues during 
model parameters estimation. Lor instance, efficient adaptation 
with a limited amount of adaptation data was obtained through 
the use of a tree data structure to cluster model parameters of 
a CD-GMM-HMM system in (37]|. Similar ideas have been re¬ 
cently explored in DNN learning for enhancing classification 
performance for classes with few examples in 1381 . where hier¬ 
archical priors where devised for the output layer weights ma¬ 
trix (top-level weights in a DNN) using a tree data structure 
either fixed or learnable during training. 

Top-level DNN weights in a hybrid acoustic model can be 
regarded as senone embeddings ED, and hierarchical priors 
can be defined by organising those embedding in a tree data 
structure. Let W(d+i)xl denote the output layer weights ma¬ 
trix (including the bias terms). Each line in W(^+i) X l cor¬ 
responds to a senone embedding. Specifically, the sth senone 
embedding can be denoted as w s , which is the sth row in 
W (D+1)XL . The tree structure used to generate hierarchical 
priors can be either learnt during training or given. Here, we 
used a fixed two-layer tree shown in Figure [2] there are L leaf 
nodes, with each leaf corresponding to a senone embedding, 
and S parent nodes clustering together similar leaf nodes. Each 
parent node clusters senone embeddings sharing the same cen¬ 



Ligure 2: Lixed two-layer tree for hierarchical priors generation. 
Each leaf node represents a senone embedding, and it is thereby 
a row ofW (jD+ i )XjL . 

Table 3: WER comparisons of flat and hierarchical priors for 
MAP LHN with a small amount of adaptation utterances. 


# Adaptation 
Sentences 

MAP LHN 
Flat Priors 

MAP LHN 
Hierarchical Priors 

5 

8.54% 

8.48% 

10 

8.52% 

8.45% 


tral phone-state; therefore, S is equal to 130 in this work. Hi¬ 
erarchical priors can now be established by associating a vec¬ 
tor w s to a leaf node, and a vector 0 S to each parent node, 
and imposing a Gaussian probability density distribution over 
these two vectors as follows: 6 S ~ A/^O, ^I^+i)), and 

W s ~ Af{O s , ^I(D+l))- 

The objective function with hierarchical priors is in the 
form of Eq. GT 

C^ p = CtT + ^ ]T||w s - 6>s|| 2 + ^||0|| 2 . (13) 

It is can be verified that 6 S is a scaled average of all w s 
associated to the sth leaf node by minimizing Eq. [l3]over 6 S 
with fixed DNN weights (see OH). We focus our attention on 
experimental results with very small adaptation data amounts, 
as shown in Table [3] With limited adaptation data, namely 5, 
10 utterances, small performance improvements are observed 
against using flat priors when adaptation is carried out with hier¬ 
archical priors. Although the current improvement is still quite 
small, we believe more sophisticated trees can be adopted for 
better performance in future studies. 

6. Conclusion 

We have investigated a maximum a posteriori (MAP) adaptation 
approach for linear hidden networks. The key idea is to treat the 
parameters of the augmented affine transformation as random 
Gaussian variables and incorporate prior information obtained 
from the training data. Speaker adaptation results show that the 
proposed MAP approaches can lead to a consistent performance 
improvement over conventional LHN adaptation. Furthermore, 
MAP LHN outperforms other regularisation schemes. 

A first attempt to use hierarchical-based priors with a fixed 
two-layer tree structure was also studied, and small improve¬ 
ments were observed in a set of preliminary ASR experiments 
using a limited amount of adaptation sentences. Better results 
might still be hindered by the current fixed tree hierarchy struc¬ 
ture employed in this preliminary work. Indeed, it was demon¬ 
strated that learning the tree hierarchy during training improves 
the classification performance 08). Finally, from the objec¬ 
tive function perspective, we are still relying on cross-entropy. 
Other forms of frame-level and sequence-level discriminative 
objectives 0]|4O]| can also be applied. 
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